Skip to content

[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp#264

Merged
YWHyuk merged 3 commits into
feature/tog-python-bindingfrom
fix/review-floormod-decompose
Jun 18, 2026
Merged

[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp#264
YWHyuk merged 3 commits into
feature/tog-python-bindingfrom
fix/review-floormod-decompose

Conversation

@YWHyuk

@YWHyuk YWHyuk commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Fixes from the max-effort code review of #259, each verified against the C++ reference and reachability. Squashed/folded with the former #265 (SDPA vcix fix).

Correctness

  • get_dma_info: raise on unlinearized floor/mod (store-side ModularIndexing, reduction-axis floor/mod) instead of silently mis-striding ([A]).
  • decompose_transfer: pick the non-unit dim per collapse group (was a trailing unit dim); >4D peel raises when the vlane axis is peeled into the outer loop ([B][C]).
  • graph_copy: per-dim max extent for the consumer iteration shape (was rank-only, an order-dependent silent miss of an incompatible-radix conflict) ([I]).
  • lower_to_vcix _lower_matmul: lower a fused matmul whose operand is produced in-place by affine.vector_store (SDPA scores·V), not just DMAed -- was left un-lowered -> wrong attention output. Mirrors the C++ isAInitialized/isBInitialized.
  • lower_to_vcix _make_sf_vc_v_iv: fix the ExtractStridedSliceOp arg order/types so the n>1 transcendental chunk (SDPA softmax exp) stops crashing.

Hardening / parity / docs

  • lower_to_vcix: F16/BF16 sew=0 + rank!=1 guard (C++ parity, fp16 transcendentals stay on -convert-math-to-llvm) ([E]); matmul M/N/K-multiple-of-SS and A/B subtileK guards as loud failures ([F]).
  • graph_copy default-on docstrings corrected ([H]).

Refuted (no change): dma_fine_grained bias tag map (Python writer/wait mutually consistent), build_tog loop-stride dedup (no duplicate keys emitted).

Validated end-to-end (Spike+TOGSim allclose): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite (incl. pixel_shuffle), reduce, softmax, layernorm, batchnorm, gqa, SDPA (was crash/wrong) -- all pass. SDPA bisected: C++ vcix passes, Python vcix did not; localized to _lower_matmul.

YWHyuk and others added 3 commits June 18, 2026 14:25
…ases

Three fixes from the max-effort review of this branch:

- get_dma_info: after retiring the floor/mod recompile branches, a residual
  floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible
  radix) that axis-split/graph-copy did not linearize was silently bucketed by
  its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor.
  Raise NotImplementedError instead of mis-striding silently. No test triggers
  it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net.

- decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of
  each reassociation group, which is a unit dim when trailing unit dims attach
  after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the
  wrong axis. Pick the non-unit dim in each group.

- decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split
  axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop
  (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine).

Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle
>4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
More fixes from the max-effort review, after verifying each against the C++
reference and reachability:

- graph_copy _relayout_args: ranges picked the consumer iteration shape by rank
  alone (max key=len), so for two equal-rank operands with different per-dim
  extents the broadcast-from operand's smaller shape could win and the real
  incompatible-radix conflict on the broadcast-to dim was missed (order-dependent:
  a commutative reorder flipped correct relayout into a silent miss). Use per-dim
  max extent over the max-rank operands.

- lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType --
  F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm,
  as in the validated path) instead of being lowered to VCIX, and add the missing
  rank != 1 guard.

- lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a
  multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the
  tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick
  one silently). Latent today (heuristic/autotune only emit SS-multiple tiles).

- Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the
  two stale 'no-op unless set' comments.

Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite,
softmax, layernorm -- all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix exp chunk

Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but
the gemm/bmm/conv tests do not:

- _lower_matmul bailed with 'if ATag is None or BTag is None: return False',
  gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V
  matmul, operand B is the softmax output produced in place by affine.vector_store,
  not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong
  attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by
  either a dma_start OR a preceding affine.vector_store into its root memref; bail
  only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only
  read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++).

- _make_sf_vc_v_iv n>1 transcendental chunking called
  vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order,
  missing the result type and vector operand, raising TypeError under these MLIR
  bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only
  reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile
  (n==1) tests never hit it.

Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong);
matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did
not; exp chunking and fine-grained ruled out separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@YWHyuk YWHyuk changed the title [Frontend] Fix floor/mod guard + decompose unit-dim/vlane-axis edge cases [Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp Jun 18, 2026
@YWHyuk YWHyuk merged commit 417e4f2 into feature/tog-python-binding Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant